18 research outputs found

    Understanding Cityscapes: Efficient Urban Semantic Scene Understanding

    Get PDF
    Semantic scene understanding plays a prominent role in the environment perception of autonomous vehicles. The car needs to be aware of the semantics of its surroundings. In particular it needs to sense other vehicles, bicycles, or pedestrians in order to predict their behavior. Knowledge of the drivable space is required for safe navigation and landmarks, such as poles, or static infrastructure such as buildings, form the basis for precise localization. In this work, we focus on visual scene understanding since cameras offer great potential for perceiving semantics while being comparably cheap; we also focus on urban scenarios as fully autonomous vehicles are expected to appear first in inner-city traffic. However, this task also comes with significant challenges. While images are rich in information, the semantics are not readily available and need to be extracted by means of computer vision, typically via machine learning methods. Furthermore, modern cameras have high resolution sensors as needed for high sensing ranges. As a consequence, large amounts of data need to be processed, while the processing simultaneously requires real-time speeds with low latency. In addition, the resulting semantic environment representation needs to be compressed to allow for fast transmission and down-stream processing. Additional challenges for the perception system arise from the scene type as urban scenes are typically highly cluttered, containing many objects at various scales that are often significantly occluded. In this dissertation, we address efficient urban semantic scene understanding for autonomous driving under three major perspectives. First, we start with an analysis of the potential of exploiting multiple input modalities, such as depth, motion, or object detectors, for semantic labeling as these cues are typically available in autonomous vehicles. Our goal is to integrate such data holistically throughout all processing stages and we show that our system outperforms comparable baseline methods, which confirms the value of multiple input modalities. Second, we aim to leverage modern deep learning methods requiring large amounts of supervised training data for street scene understanding. Therefore, we introduce Cityscapes, the first large-scale dataset and benchmark for urban scene understanding in terms of pixel- and instance-level semantic labeling. Based on this work, we compare various deep learning methods in terms of their performance on inner-city scenarios facing the challenges introduced above. Leveraging these insights, we combine suitable methods to obtain a real-time capable neural network for pixel-level semantic labeling with high classification accuracy. Third, we combine our previous results and aim for an integration of depth data from stereo vision and semantic information from deep learning methods by means of the Stixel World (Pfeiffer and Franke, 2011). To this end, we reformulate the Stixel World as a graphical model that provides a clear formalism, based on which we extend the formulation to multiple input modalities. We obtain a compact representation of the environment at real-time speeds that carries semantic as well as 3D information

    Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection

    Full text link
    Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multitask learning, we provide a pairing of 2D instance segments with 3D bounding boxes. In addition, we complement the Cityscapes benchmark suite with 3D vehicle detection based on the new annotations as well as metrics presented in this work. Dataset and benchmark are available online.Comment: 2020 "Scalability in Autonomous Driving" CVPR Worksho

    3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

    Full text link
    Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.Comment: 17 pages, 8 figures, accepted by ICCV202

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    Full text link
    Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.Comment: Includes supplemental materia

    Coarse-to-Fine Annotation Enrichment for Semantic Segmentation Learning

    Full text link
    Rich high-quality annotated data is critical for semantic segmentation learning, yet acquiring dense and pixel-wise ground-truth is both labor- and time-consuming. Coarse annotations (e.g., scribbles, coarse polygons) offer an economical alternative, with which training phase could hardly generate satisfactory performance unfortunately. In order to generate high-quality annotated data with a low time cost for accurate segmentation, in this paper, we propose a novel annotation enrichment strategy, which expands existing coarse annotations of training data to a finer scale. Extensive experiments on the Cityscapes and PASCAL VOC 2012 benchmarks have shown that the neural networks trained with the enriched annotations from our framework yield a significant improvement over that trained with the original coarse labels. It is highly competitive to the performance obtained by using human annotated dense annotations. The proposed method also outperforms among other state-of-the-art weakly-supervised segmentation methods.Comment: CIKM 2018 International Conference on Information and Knowledge Managemen

    RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

    Full text link
    We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.Comment: ACM Multimedia 202

    Understanding Cityscapes: Efficient Urban Semantic Scene Understanding

    No full text
    Semantic scene understanding plays a prominent role in the environment perception of autonomous vehicles. The car needs to be aware of the semantics of its surroundings. In particular it needs to sense other vehicles, bicycles, or pedestrians in order to predict their behavior. Knowledge of the drivable space is required for safe navigation and landmarks, such as poles, or static infrastructure such as buildings, form the basis for precise localization. In this work, we focus on visual scene understanding since cameras offer great potential for perceiving semantics while being comparably cheap; we also focus on urban scenarios as fully autonomous vehicles are expected to appear first in inner-city traffic. However, this task also comes with significant challenges. While images are rich in information, the semantics are not readily available and need to be extracted by means of computer vision, typically via machine learning methods. Furthermore, modern cameras have high resolution sensors as needed for high sensing ranges. As a consequence, large amounts of data need to be processed, while the processing simultaneously requires real-time speeds with low latency. In addition, the resulting semantic environment representation needs to be compressed to allow for fast transmission and down-stream processing. Additional challenges for the perception system arise from the scene type as urban scenes are typically highly cluttered, containing many objects at various scales that are often significantly occluded. In this dissertation, we address efficient urban semantic scene understanding for autonomous driving under three major perspectives. First, we start with an analysis of the potential of exploiting multiple input modalities, such as depth, motion, or object detectors, for semantic labeling as these cues are typically available in autonomous vehicles. Our goal is to integrate such data holistically throughout all processing stages and we show that our system outperforms comparable baseline methods, which confirms the value of multiple input modalities. Second, we aim to leverage modern deep learning methods requiring large amounts of supervised training data for street scene understanding. Therefore, we introduce Cityscapes, the first large-scale dataset and benchmark for urban scene understanding in terms of pixel- and instance-level semantic labeling. Based on this work, we compare various deep learning methods in terms of their performance on inner-city scenarios facing the challenges introduced above. Leveraging these insights, we combine suitable methods to obtain a real-time capable neural network for pixel-level semantic labeling with high classification accuracy. Third, we combine our previous results and aim for an integration of depth data from stereo vision and semantic information from deep learning methods by means of the Stixel World (Pfeiffer and Franke, 2011). To this end, we reformulate the Stixel World as a graphical model that provides a clear formalism, based on which we extend the formulation to multiple input modalities. We obtain a compact representation of the environment at real-time speeds that carries semantic as well as 3D information
    corecore